Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically select packing ratio #622

Closed
wants to merge 9 commits into from

Conversation

irenedea
Copy link
Contributor

@irenedea irenedea commented Sep 22, 2023

Manual test

finetune-auto-pack-BAUz9w https://wandb.ai/mosaic-ml/irene-test/runs/e3pc1puh

  • 54 batches - packing ratio is ~11

finetune-auto-pack-baseline-lvmog9 https://wandb.ai/mosaic-ml/irene-test/runs/vdxwzlxg

  • 618 batches

collate_fn=collate_fn,
batch_size=dataloader_batch_size,
drop_last=cfg.drop_last,
# sampler=dist.get_sampler(dataset, # TODO why was this not used in the first return in the original code?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: add back in

for _, leftover in self.collator._leftover_bins:
yield leftover

class BinPackCollator:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: remove this class altogether in favor of BinPackDataset, logic from call should be moved to BinPackDataset class

'attention_mask',
'bidirectional_mask',
]

# Cut everything down to size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

size, trimmed_example = extract_trim_batch_idx(batch, idx)
sizes.append(size)
trimmed_examples.append(trimmed_example)
sizes = [len(example['input_ids']) for example in examples]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assume that we no longer need to trim examples if we pack at the dataset level?

Are datasets always unpadded?

Comment on lines +135 to +137
# if k == 'sequence_id':
# example[k] = torch.cat(
# [example[k], add_on[k] + 1 + torch.max(example[k])])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: add this back in.

Comment on lines +251 to +264
# min_ratio = 2
# max_ratio = 2
# num_packing_ratios = 1
# profiling_results = profile_packing(dataloader_cfg, tokenizer, min_ratio,
# max_ratio, num_packing_ratios,
# device_batch_size)

# # Obtain the maximum packing_ratio/minimum padding that has no waste.
# i = 0
# waste = 0
# packing_ratio = 1
# while i < len(profiling_results) and waste == 0:
# packing_ratio, _, waste = profiling_results[i]
# i += 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncomment and update min/max ratios as appropriate

probably gonna go for something like max_ratio = max_seq_len / 100
and num_packing_ratios = 15

return batches

def profile(raw_batch_size: int) -> Tuple[float, float]:
packer = BinPackCollator(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace in favor of BinPackDataset

assert packed_samples[1] == [7] * 7
assert packed_samples[2] == [6] * 6

# def test_auto_packing():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test for full auto packing flow

@irenedea
Copy link
Contributor Author

closing because we decided to stick with the collator version. This may cause small amounts of waste in practice, but is much simpler to implement and maintain.

@irenedea irenedea closed this Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant